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ABSTRACT 

mirConnX is a user-friendly web interface for infer- 
ring, displaying and parsing mRNA and microRNA 
(miRNA) gene regulatory networks. mirConnX 
combines sequence information with gene expres- 
sion data analysis to create a disease-specific, 
genome-wide regulatory network. A prior, static 
network has been constructed for all human and 
mouse genes. It consists of computationally pre- 
dicted transcription factor (TF)-gene associations 
and miRNA target predictions. The prior network is 
supplemented with known interactions from the lit- 
erature. Dynamic TF- and miRNA-gene associations 
are inferred from user-provided expression data 
using an association measure of choice. The static 
and dynamic networks are then combined using an 
integration function with user-specified weights. 
Visualization of the network and subsequent 
analysis are provided via a very responsive graphic 
user interface. Two organisms are currently 
supported: Homo sapiens and Mus musculus. The 
intuitive user interface and large database 
make mirConnX a useful tool for clinical scientists 
for hypothesis generation and explorations. 
mirConnX is freely available for academic use at 
http://www.benoslab.pitt.edu/mirconnx. 

INTRODUCTION 

Since its discovery two decades ago, it has become increas- 
ingly clear that microRNAs (miRNAs) play a crucial role 
in modulating gene expression at the post-transcriptional 
level. The small, 22-nt long RNA molecules fine-tune gene 
expression by base pairing to target messenger RNAs, 
resulting in its degradation or causing translational repres- 
sion. As Pandit et al. (1) has shown, deregulation of even a 



single miRNA may cause complex human diseases. 
Regulatory network reconstruction methods have trad- 
itionally involved transcriptional regulation only. 
Incorporating miRNAs thus becomes the next natural 
step. Only few tools have explored ways to associate 
mRNA and miRNA expression to infer regulations. 
MMIA (2) and MAGIA (3), for example, utilize associ- 
ation metrics such as correlation and mutual information. 
In a different context, Huang et al. (4) employed a 
Bayesian model to identify miRNA targets from sequence 
features and expression data. However, there are several 
limitations to these tools. MMIA only examines a subset 
of miRNAs that are significantly up- or downregulated, 
and omits those that could potentially be significantly 
correlated with their targets if they are not considered to 
be differentially expressed, based on the specific threshold. 
This only limits the data to those with a control/disease 
contrast, excluding possible use of time-series data. 
GenMir++ (4) is a more sophisticated algorithm, but it 
becomes computationally inefficient when a large number 
of genes is considered. Furthermore, it does not take into 
account other supporting information such as the tran- 
scriptional regulation. In fact, none of these tools incorp- 
orates the full set of transcription factors (TFs) in the 
global network construction. Additionally, network motifs 
such as feed-back and feed-forward loops that are known 
to have an important role in cancer development and 
other diseases are usually not identified as part of the 
routine analyses of the currently available tools. 

To this end, we developed mirConnX to attempt to 
address some of the above concerns. mirConnX (http:// 
www.benoslab.pitt.edu/mirconnx) takes advantage of 
prior knowledge (from sequence data), and incorporates 
evidence from gene expression data to create condition- 
specific genome-wide regulatory networks. mirConnX 
also aims to identify gene network motifs, involving tran- 
scription factors and miRNAs, that are associated with 
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the corresponding diseases, pathogenesis or phenotype of 
interest. 

TOOL DESCRIPTION 

General framework 

mirConnX aims to provide an integrated environment 
that allows the user to infer genome-wide transcriptional 
(TF-gene/miRNA) and post-transcriptional (miRNA- 
gene/TF) regulatory networks for a particular disease or 
condition. We consider mRNA and miRNA expression 
data measured under the same set of conditions, or at 
the same time points, or from the same corresponding 
diseased or normal samples (matching samples). The 
mRNA and miRNA expression data are pre-processed 
to remove genes that are lowly expressed with limited 
variance overall. Then, we connect TFs and miRNAs to 
genes using a statistical association measure. The associ- 
ation network that is constructed reflects the disease status 
or the condition of interest. This network is an undirected 
graph, in which an edge exists between two nodes (genes) 
if an interaction has been detected. Note that such associ- 
ation networks do not discriminate between direct and 
indirect interactions. This network is then superimposed 
to a pre-compiled, species-specific prior network, which is 
derived from TF motif scanning and binding, miRNA 
target predictions and the literature evidence. The prior 
network is a directed, weighted graph, in which an edge 
between a TF or miRNA and a gene exists if the former is 
predicted to regulate the latter. All connections in the 
prior network correspond to direct, predicted or verified 
interactions. Superimposing the two networks via an inte- 
gration function results in a directed network, which is 
expected to contain significantly fewer indirect inter- 
actions (depending on the weight the user assigns on the 
prior network). mirConnX web tool allows easy visualiza- 
tion and exploration of the network, and identifies 
network motifs. In the following sections, we describe 
the construction of the context-dependent (dynamic) asso- 
ciation network, the construction of the prior (static) 
network and their integration. Figure 1 presents an 
overview of the mirConnX pipeline. 

Input formats 

mirConnX accepts normalized mRNA and miRNA ex- 
pression data in tab-delimited files, where the first row con- 
tains sample IDs and the first column contains mRNA or 
miRNA IDs. mirConnX supports gene symbols, Ensembl 
Gene ID, Ensembl Transcript ID, Entrez Gene ID, RefSeq 
DNA ID and Unigene ID as mRNA identifiers and 
miRBase miRNA ID and Accession numbers as miRNA 
identifiers. An example of the matching mRNA-miRNA 
data sets can be found and pre-loaded on the front page. 
Note that the sample IDs for mRNA and miRNA data 
should match. Any unmatched samples are discarded. 
mirConnX allows multiple columns with the same header 
in case of biological or technical replicates. The input data 
sets are stored only during a user's session and are used to 
construct the association network. If no miRNA data file is 
included, the resulting network will show only TF-gene 



interactions. We currently support two organisms: human 
{Homo sapiens) and mouse (Mus musculus), as genome an- 
notation and prior information is most abundant for these 
species. 

Constructing the prior network from sequence and 
literature data 

The prior network is constructed by combining all predic- 
tions of TF to gene and TF to miRNA interactions and all 
miRNA target predictions. The network is then enhanced 
by literature evidence that confirms the existence of an 
edge. This results in a directed network that represents 
the collection of prior knowledge on regulatory potentials 
between genes. 

TF to genejmiRNA regulations. We define the binding po- 
tential (Rgrp) of a promoter sequence for a given gene/ 
miRNA as the maximum score between literature evidence 
(Slit £ {04}) and TF binding score (SVf), which is 
calculated using a sliding window method (5) on the pro- 
moters of genes and miRNAs. The JASPAR (6) and 
TRANSFAC (7) position weight matrices (PWMs) are 
used for the scanning. A subsequence is considered as a 
binding site for a TF if its PWM score is on the top 1 % of 
all scores for this PWM. In addition, UCSC Regulation 
Track Conserved TFBS Scores (Scons) ar e added to 
enhance confidence. The sum of S TF and Scons is 
normalized to a score between 0 and 1 . Finally, if an ex- 
perimentally verified binding motif for a given TF is avail- 
able for this promoter (e.g. in TRANSFAC), then S L i T 
becomes 1, and so does RgT^. 

Rg TF = max{(|5 T F|+|S , coNsl),S , Lrr} 

Regular gene promoters were defined as the region 5 kb 
upstream of TSS. Promoters are obtained from Database 
of Transcription Start Sites (DBTSS) (8), The Eukaryotic 
Promoter Database (EPD) (9) and UCSC genome browser 
Regulation-Transcription track (Eponine and SwitchGear 
TSS). miRNA TSSs are defined using a combination of 
predictions and experiments from CoreBoost_HM (10), 
Marson et al. (11) and Corcoran et al. (12). Human 
(NCBI36/Hgl8) and mouse (NCBI37/mm9) sequence 
data were downloaded from UCSC genome browser (13). 

miRNA to gene/TF regulations. miRNA target prediction 
algorithms generally do not agree very well. Thus, we used 
a combination of five target prediction algorithms that 
take into account the seed sequence, flanking sequences 
and context, binding energy and conservation. These al- 
gorithms are: PITA (14), miRANDA (15), TargetScan 5.0 
(16), RNAhybrid (17) and Pictar (18). We define the regu- 
latory potential (Rg mi R) of an miRNA for a gene as the 
proportion of the target prediction algorithms predicting 
the gene to contain at least one miRNA target site. 
If predictions for corresponding genome versions are 
not available, we ran the algorithms using default param- 
eters and cutoffs. In addition, if the 3'-UTR of a gene 
contains an experimentally verified site from TarBase 
(19) or miRecords (20), then SxarBase or S miRecords 
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Figure 1. Overview of the integrated analysis in mirConnX. Visualization is achieved with the use of Cytoscape Web vO.7.2 (25). Feed-forward loops 
are displayed on a separate tab. Links to external databases provided for every coding gene or miRNA. Statistical analysis on GO pathway (45) 
overrepresentation in a user-selected set of genes will be added soon. 



Nucleic Acids Research, 2011 , Vol. 39, Web Server issue W419 



score becomes 1 (otherwise is 0), and so does the regula- 
tory potential of the gene for a given miRNA: 

RgmiR = max] | 

Human and mouse 3'-UTRs were downloaded from 
UCSC genome browser. The list of mature and comple- 
ment miRNAs, as well as their sequences, were obtained 
from miRBase v. 14 (21). 

Gene expression pre-processing 

Standard gene symbol and miRNA ID are used as our 
primary identifier. Genes and miRNAs with multiple 
probes on the array, or those converted to the same 
gene symbol/miRNA ID, are collapsed into a single 
medium value. The normalized mRNA and miRNA ex- 
pression data are pre-processed using three filters for low 
(i) absolute expression, (ii) variance and (hi) entropy. A 
cutoff of 5% is used for mRNA and miRNA expression 
data individually to remove data that are not likely to be 
important for the network. A list of the genes filtered and 
excluded from the analysis is available for user to down- 
load. Finally, all matching conditions or samples between 
the mRNA and miRNA data matrices are retained for 
analysis. In case of multiple replicates for the same con- 
dition, the median value between replicates is used. 

Constructing the association network from gene 
expression data 

We construct an association network from the user- 
supplied expression data by measuring the strength of all 
pair-wise interactions between TFs, miRNAs and genes 
across the samples/replicates. A number of parametric 
and non-parametric association metrics is available to 
the user for defining these interactions. Correlation coef- 
ficient is one of the most intuitive, and most well received. 
The different flavors of correlation (Pearson, Spearman 
and Kendall) have been used successfully in the past and 
achieved different levels of success, for example in the 
WGCNA R package (22). Pearson correlation coefficient 
is often used when a linear dependence between the vari- 
ables exists. In contrast, Spearman p correlation coeffi- 
cient applies Pearson formula on the ranks of the values 
of the two variables and can detect similarities even if 
non-linear (but monotonic) association exists. Kendall x 
rank correlation coefficient also operates on the ranks, but 
it calculates the probability of concordance or discordance 
of any pair of observations. In general, Spearman and 
Kendall give similar results, but they differ on the magni- 
tude [for more details on correlation measures, see Ref. 
(23)]. We implemented these three correlation measures 
and applied them on pairs of gene, TF or miRNA expres- 
sion values across matching conditions. The absolute mag- 
nitude reflects the level of correlation, and the sign 
suggests positive or negative interaction. Mutual informa- 
tion is a non-parametric test that has been implemented in 
algorithms such as ARACNE (24) as the measure of an 
association for genome-wide pairwise interactions. Mutual 
information is non-negative, and as such it does not 
provide information about the sign of interaction. 



Furthermore, it is generally computationally intensive 
and sample-size sensitive, since it requires an estimation 
of marginal and joint probabilities of the variables. For all 
these reasons, we are not currently implementing the 
mutual information in miRconnX. The degree of associ- 
ation, /'assoc, is defined as the probability that two genes 
are correlated. We used the inverse of correlation coeffi- 
cient significance (1 — p) as the probability of non-random 
association. The use of significance, instead of the coeffi- 
cient itself, takes into account of the sample size and 
allows a fair comparison between networks generated by 
different sizes of data. 

Network integration 

Integration of the prior confidence of association (based 
on sequence data, literature evidence and predictions) and 
the correlation network (based on the gene expression 
data) is currently done via a simple weighted sum 
function (S). 

S = Kprior(^TF, -KgmilO+Kassoc^assoc) 

In this equation, y pr i 0 i- is a user-defined parameter 
between 0 and 1, and y assoc = 1— Kprior- The default for 
Kprior is 0.3 (i.e. 30%) and a value <0.5 is recommended 
for the prior information. The user can also define a cutoff 
for the combined regulation score, S. This is also a 
number between 0.0 and 1.0. The higher the S the fewer 
connections will be reported. A value of 0.7—0.99 is rec- 
ommended. Finally, for practical purposes, we cap the 
number of interactions to be displayed on screen at 
3000, as beyond that the network becomes too large to 
be efficiently visualized. 

Submission and wait time 

Depending on the size of the files (number of genes anal- 
yzed) and types of analysis chosen, the analysis could take 
anywhere from minutes to up to an hour. As an example, 
for 20000 genes and 500 miRNA, the computing time is 
roughly 15min using Pearson correlation. While the job is 
running, an execution log will be displayed. The user can 
close the browser window. When the job finishes, the user 
will receive an email notification and retrieve the results 
from the link provided. 

mirConnX output 

Following the link to the result, a visualization of the 
network is displayed, as shown in Figure 2. Cytoscape 
Web v 0.7.2 (25) is used for network display. The render- 
ing time for a network with 1 500 nodes and 2500 connec- 
tions is about 15 s. Once uploaded, browsing the various 
areas of the network is instantaneous. Users can use the 
tools at the bottom right corner to zoom in/out and edit 
node placements on the visualization page, and output the 
visualization as graphics. The list of interactions is also dis- 
played with links to external databases such as miRBase 
and Entrez Gene (26) for annotation, PubFocus (27), 
EBIMed (28), miR2Disease (29) and miRo (30) to facili- 
tate clinical research by sifting through a large body of lit- 
erature and records, as well as Gene Ontology (31) terms 
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Figure 2. Snapshot of the glioblastoma case study output. An example search for the downstream targets of miR-21, a key player in glioblastoma 
development, is shown in the middle. 
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for each gene. We also make available for download: 

(i) list of interactions above the user-defined display cutoff, 
ranked by regulatory score in tab-delimited text file; 

(ii) list of nodes, ranked by degree centrality in text- 
delimited text file; and (iii) the network in pdf or 
GRAPHML graph formats compatible with Cytoscape 
for further exploration. The user can (iv) search for a par- 
ticular node and its targets/regulators (through the 'List of 
gene interactions: filtered' drop down menu), a set of par- 
ticular interaction and highlight or select the correspond- 
ing nodes and edges on the graph display. Finally, we 
display all (v) feed-forward loops and their neighbors at 
the given threshold. In addition, a summary of statistics, 
including the actual number of TFs, miRNAs and genes 
can be retrieved under 'execution log'. 



CASE STUDY 

The main idea behind mirConnX was first used to analyze 
lung epithelial gene expression data a few years ago (1,12). 
In that study, we were able to identify a feed-forward loop 
that included SMAD TFs, let-7d and HMGA2 gene, 
which was central in the regulation of epithelial to mesen- 
chymal transition (EMT). Furthermore, we later found 
that knocking down of let-7 d in the trachea of mice can 
cause lung fibrosis few days later (1). 

Here, we present a case study that demonstrates the 
utility of mirConnX. We downloaded a set of publicly 
available mRNA and miRNA expression profiles from 
The Cancer Genome Atlas (TCGA) pilot project (http:// 
cancergenome.nih.gov/), where a large compendium of 
tumor and normal glioblastoma multiforme (GBM; 
primary brain tumor) expression data is available. The 
choice of this disease is two-fold: in this repository, 
GBM is one of the two diseases with both tumor and 
normal cells. Furthermore, recent studies have revealed 
distinct patterns of miRNA expression in tumor 
compared to normal brain (32) and several miRNA 
targets have in fact been experimentally verified (33-35). 
The disease samples are characterized by rapid prolifer- 
ation and stem-cell like behavior that is possibly caused by 
malfunctioning of characteristic pathways (36). Mutations 
in miRNA and miRNA targets have been postulated to be 
involved in tumorigenesis, but have not been specifically 
identified in GBM. 

The expression profiles downloaded consist of a total of 
58 matched mRNA and miRNA samples from the Agilent 
244 k aCGH platform at data level 3. We used the follow- 
ing parameters on mirConnX: Gene Symbols, miRBase 
ID, Pearson correlation with a prior weight of 0.3 and 
0.9 as the display cutoff threshold. A total of 56 
miRNAs, 29 TFs and 1180 genes form a network with a 
total of 1851 connections. Of these interactions, 43 are 
miR-TF regulations, 34 TF-gene connections and 1774 
miRNA-gene connections. 

Among the top interactions, we were able to identify 
four hubs: miR-21, miR-326, miR-34, and miR-137, 
which have been verified to be miRNAs involved in 
Gliobastoma. These four miRNAs are also hubs with 
some of the highest degree centrality (37), sharing many 



targets and TFs with other hubs. Among them, miR-21 
has been found to be one of the most highly expressed 
miRNAs in many cancer types, and it has been shown 
that miR-21 acts as an oncogen in glioblastoma by sup- 
pressing apoptosis (38). Among the highest ranking 
targets we predict for miR-21, SOX2 (39) and TGFB 
pathway (40) were shown to be regulated by the 
miRNA. RECK and PDCD4 have been experimentally 
verified, in vivo and in vitro, to be involved in proliferation 
(34,41). In addition, PELI1 and CDC25A have been 
shown in other cancer types to play a role in apoptosis 
(38,42,43). Similarly, miR-137 has been shown to be 
involved in proliferation and neuronal differentiation 
in vitro (44). Indeed, both CDK6 and MITF, the experi- 
mentally verified targets from the study were also pre- 
dicted in our network. 

A thorough literature search on all of the predicted 
interactions for Gliobastoma is not possible here, but we 
demonstrated that mirConnX is useful for identifying hub 
genes, their regulators, and their targets involved in 
diseases, the pathways involved and could potentially be 
a powerful tool for clinical scientists to create a list of top 
candidate genes and forming hypotheses. 



CONCLUSIONS 

In recent years, with the availability of condition-specific 
high-throughput mRNA and miRNA expression data, 
there is an increasing need of an integrated environment 
that combines data analyses and visualization to construct 
hypothesized networks. While many methods exist for 
network generation using only expression data, only 
binding affinity experiments such as ChlP-chip, or even 
manually curated data from expert knowledge databases, 
an integrated network that maximally exploits informa- 
tion in both domains is lacking. Additionally, there has 
not been many attempts to incorporate both TF and 
miRNA regulations, yet it has become increasingly clear 
that miRNAs play a crucial role in human diseases. 
mirConnX is a novel web tool developed specifically to 
fill the niche. The utility of mirConnX lies in its ability 
to integrate user-supplied data with pre-compiled infor- 
mation of miRNA targeting and TF binding, and 
generate a network that reflects characteristics specific to 
the data guided by some prior beliefs. The user-friendly 
display of interaction networks and other downstream 
analyses also provides an integrated environment for 
clinical researchers to perform further investigation and 
exploration. 
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